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Abstract 

We 1 develop the concept of ABC-Boost (Adaptive Base Class Boost) for multi-class classification and 
present ABC-MART, a concrete implementation of ABC-Boost. The original MART (Multiple Additive Regression 
Trees) algorithm has been very successful in large-scale applications. For binary classification, ABC-MART re- 
covers MART. For multi-class classification, ABC-MART considerably improves MART, as evaluated on several 



o 

CZ2 , public data sets. 

o 

1 Introduction 

> , 

Classification is a basic task in machine learning. A training data set {j/j, X.;}^ consists of N feature vectors 
(samples) Xj, and N class labels, i = 1 to N. Here y$ € {0, 1, 2, K — 1} and K is the number of classes. 
The task is to predict the class labels. This study focuses on multi-class classification (K > 3). 

Many classification algorithms are based on boosting[7, 2, 3, 5, 4], which is regarded one of most signifi- 
cant breakthroughs in machine learning. MART[4] (Multiple Additive Regression Trees) is a successful boosting 
algorithm, especially for large-scale applications in industry practice. For example, the regression-based rank- 
ing method developed in Yahoo! [1, 9] used an underlying learning algorithm based on MART. McRank[6], the 
classification-based ranking method, also used MART as the underlying learning procedure. 

This study proposes ABC-Boost (Adaptive Base Class Boost) for multi-class classification. We present ABC- 
MART, a concrete implementation of ABC-Boost. ABC-Boost is based on the following two key ideas: 

^ : 

1. For multi-class classification, popular loss functions for K classes usually assume a constraint^, 4, 8] such 
that only the values for K — 1 classes are needed. Therefore, we can choose a base class and derive 
algorithms only for K — 1 classes. 

2. At each boosting step, although the base class is not explicitly trained, it will implicitly benefit from the 
training on K — 1 classes, due to the constraint. Thus, we adaptively choose the base class which has the 
"worst" performance. 

The idea of assuming a constraint on the loss function and using a base class may not be at all surprising. For 
binary (K = 2) classification, a "sum-to-zero" constraint on the loss function is automatically considered so that 
we only need to train the algorithm for one (instead of K = 2) class. For multi-class (K > 3) classification, 
the sum-to-zero constraint on the loss function is also ubiquitously adopted[5, 4, 8]. In particular, the multi-class 
Logitboost[5] algorithm was derived by explicitly averaging over K — 1 base classes. 



'The first draft was submitted to IEEE ICDM on July 07, 2008. Although the submission was not accepted, the author appreciates two of 
the reviewers for their informative comments. In particular, the author appreciates that one of the reviewer considered this work "will be one 
of the state of the art." In this revision, we followed that reviewer's suggestion by adding more experiments. 
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The loss function adopted in our ABC-MART is the same as in MART[4] and Logitboost[5]. All three algo- 
rithms assume the "sum-to-zero" constraint. However, we obtain different first and second derivatives of the loss 
function, from MART[4] and Logitboost[5]. See Section 3 for details. 

In terms of implementation, our proposed ABC-MART differs from the original MART algorithm only in 
a few lines of code. Since MART is known to be a successful algorithm, much of our work is devoted to the 
empirical comparisons of ABC-MART with MART. Our experiment results on publicly available data sets will 
demonstrate that ABC-MART could considerably improves MART. Also, ABC-MART reduces both the training 
and testing time by 1/K, which may be quite beneficial when K is small. 

We notice that data sets in industry applications are often quite large (e.g., several million samples[6]). Pub- 
licly available data sets (e.g., UCI repository), however, are mostly small. In our study, the Covertype data set 
from the UCI repository is reasonably large with 581,012 observations. 

We first review the original MART algorithm and functional gradient boosting[4]. 

2 Review MART and Functional Gradient Boosting 

The MART algorithm is the marriage of the functional gradient boosting and regression trees. Given the training 
data set {y i} x,} -Y^ and a loss function L, [4] adopted a "greedy stagewise" approach to build an additive function 
F (M \ which is a sum of M terms, 

M 

F ( - M \x)=Y / P™Hx;a m ), (1) 

m—l 

such that, at each stage m, m = 1 to M, 

N 

K,a m } =argnmy]L(j/ i ,F( m - 1 >(x i ;a,p)) . (2) 

Here the function /i(x; a) is the "base learner" or "weak learner." In general, (2) is still a difficult optimization 
problem. [4] approximately conducted steepest descent in the function space, by solving a least square problem 

N 

a rn = argmin^ [-g m (*i) - ph(x t ;a)} 2 , (3) 

' r i=l 

where 

-dL (y 4 ,F(x,)) 



-g m (xi) 



dF( Xi ) 



(4) 

F((*)) = F0— 



is the negative gradient (the steepest descent direction) in the A^-dimensional data space at F^ m 1 ^(x). For 
obtaining another coefficient p m , a line search is performed: 

N 

p m = argmin^L (y u F( m-1 >(xi) + p/i(x i; a™)) . (5) 

p i=i 

A generic "gradient boosting" algorithm is described in Alg. 1, for any differentiable loss function L. For 
multi-class classification, [4] proposed MART, which implemented Line 5 in Alg. 1 by regression trees and Line 
7 by a one-step Newton update within each terminal node of the trees. 

2.1 The Loss Function and Multiple Logistic Probability Model in MART 

For multi-class classification, MART adopted the following negative multinomial log-likelihood loss, which is 
also the loss function in Logitboost[5]: 

N N f K-l ~j 

i=l i=l l k=0 J 
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Algorithm 1 A generic gradient boosting algorithm [4, Alg. 1]. 



F x = argmin p ^ 1 L{y h p) 
For m = 1 to M Do 

For k = to K - 1 Do 

dL( Vi ,F( Xi ))' 



Vi = - 



SF(x,) 



a m = argmin a p Y,f = i [Vi - P h (^ a )]" 

p m = argmin p X;^Iii (y l ,-F l(m_1) ( x l ) +/0^(x 4 ;a m )) 
F x = F x + p m h{x; a TO ) 



F((s))=F<" 



l2 



ltoiV. 



End 



End 



where r, ; fc = 1 if j/» = fc and r^fe = otherwise. Apparently, XfcLt) 1 ri > fc = 1 f° r an y *• Here Vi,k is the 
probability that the ith observation belongs to class k: 

p itk = Pr(y i = k\X i ). (7) 

MART adopted the following logistic probability model[4] 

1 K ^ 

Fi,k = logK, fe - — ^2 lo SPv. (8) 

or equivalently[4], 

Apparently, the model (8) implies J2k=o F,k — 0, the sum-to-zero constraint. In fact, since J2 k =o "Pi.k = 1, 
the model only has K — 1 degrees of freedom. Some constraint on F^k is necessary in order to obtain a unique 
solution. For binary (K = 2) classification, the sum-to-zero constraint is automatically enforced. 

2.2 The Original MART Algorithm 

Alg. 2 describes the MART algorithm for multi-class classification using negative multinomial log-likelihood loss 
(6) and multi-class logistic model (9). 

Algorithm 2 MART[4, Alg. 6]. Note that in Line 6, the term Pi, fc (l - Pi, k ) replaces the equivalent term \r itk - 

- \n,k -Pi,k\) in t 4 . A1 8- 6 ] 


1 

2 
3 
4 
5 

6 
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n t k = 1, if J/i = k, and r^fe = otherwise. 
F itk = 0, k = to if - 1, i = 1 to N 
For m = 1 to M Do 

For k = to X - 1 Do 

Pi, k = exp(F lyk )/J2fj 1 exp(F M ) 

{-Rj,fe,m}j = i = J-terminal node regression tree from {n^ — Pi, k , x i}{li 

a _ K-l ^■*itRj,k, m ri ' k ~ Pi ' k 

Pj,k,m ~ K E XieRj , fc m (l-K, fc )Pi^ 

-Pi.fc — -Pi.fc + V y^j— 1 Pj,k.m^-Xi£Rj fc m 

End 
End . 



MART follows the generic paradigm of functional gradient boosting Alg. 1. At each stage to, MART solves 
the mean square problem (Line 5 in Alg. 1) by regression trees. MART builds K regression trees at each boosting 
step. 

We elaborate in more detail on serval key components of MART. 



3 



2.2.1 The Functional Gradient (Pseudo Response) 



MART performs gradient descent in the function space, using the gradient evaluated at the function values. For 
the ith data point, using the negative multinomial log-likelihood loss (6), i.e., 



K-l 



Li (Fi,k,Vi) = - r ^ k lo SP^k (10) 



and the probability model (9), [4] showed 

dL 



= - {r%,k ~ Pi,k) ■ (11) 



dF lM 

This explains the term r^fe — pi.k in Line 5 of Alg. 2. 

2.2.2 The Second Derivative and One-Step Newton Update 

While only the first derivatives were used for building the structure of the trees, MART used the second derivatives 
to update the values of the terminal nodes by a one-step Newton procedure. [4] showed 

d 2 L- 

~ r" =Pi,k (1 ~Pi,k) , (12) 



which explains Line 6 in Alg. 2. 



(11) and (12) were also derived in the Logitboost[5]. However, in this paper we actually obtain different first 
and second derivatives. 



2.2.3 K Trees for K Classes 

For each i, there are K function values, F^k, k = to K — 1, and consequently K gradients. MART builds K 
regression tress at each boosting step. Apparently, the constraint ^2^ = q F^k = will not hold. 

Note that one can actually re-center the F^k at the end of every boosting step so that the sum-to-zero constraint 
is satisfied after training. That is, one can insert a line 

1 K ^ 

Fi, k «- F itk ~ % Yl (13) 

s=0 

after Line 8 in Alg. 2 to make J2k=a Fi,k — 0. However, we observe that this re-centering step makes no 
difference in our experiments. We believe the sum-to-zero constraint should be enforced before the training 
instead of after the training, at every boosting step. 

The case K — 2 is an exception. Because the two pseudo responses, ri,k=o — Pi,k=o and rj ; fc=i — Pi,k=i are 
identical with signs flipped (i.e., X)fe=o F%,k = automatically holds), there is no need to build K = 2 tress. In 
fact, [4] presented the binary classification algorithm separately [4, Alg. 5], which is the same as [4, Alg. 6] by 
letting K = 2, although the presentations were somewhat different. 

Line 6 of Alg. 2 contains a factor ^jp-- For binary classification, it is clear that the factor i comes from the 
mathematical derivation. For K > 3, we believe this factor in a way approximates the constraint Ylik=o Fi,k = 0. 
Because MART builds K trees while there are only K — 1 degrees of freedom, a factor ^jp- may help reduce the 
influence. 



2.2.4 Three Main Parameters: J, v, and M 

Practitioners like MART partly because this great algorithm has only a few parameters, which are not very sen- 
sitive as long as they fall in some "reasonable" range. It is often fairly easy to identify the (close to) optimal 
parameters with limited tuning. This is a huge advantage, especially for large data sets. 
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The number of terminal nodes, J, determines the capacity of the base learner. MART suggested J = 6 often 
might be a good choice. 

The shrinkage parameter, v, should be large enough to make a sufficient progress at each boosting step and 
small enough to avoid over-fitting. Also, a very small v may require a large number of boosting steps, which may 
be a practical concern for real-world applications if the training and/or testing is time-consuming. [4] suggested 
v < 0.1. 

The number of boosting steps, M, in a sense is largely determined by the computing time one can afford. A 
commonly-regarded merit of boosting is that over- fitting can be largely avoided for reasonable J and v and hence 
one might simply let M be as large as possible. However, for small data sets, the training loss (6) may reach the 
machine accuracy before M can be too large. 



We re-iterate the two key components in developing ABC-Boost (Adaptive Base Class Boost): 

1. Using the popular constraint in the loss function, we can choose a base class and derive the boosting 
algorithm for only K — 1 classes. 

2. At each boosting step, we can adaptively choose the base class which has the "worst" performance. 

ABC-MART is a concrete implementation of ABC-Boost by using the negative multinomial log-likelihood 
loss (6), the multi-class logistic model (9), and the paradigm of functional gradient boosting. Apparently, there 
are other possible implementations of ABC-Boost. For example, one can implement an "ABC-Logitboost." This 
study focuses on ABC-MART. Since MART has been proved to be successful in large-scale industry applications, 
demonstrating that ABC-MART may considerably improve MART will be appealing. 

3.1 The Multi-class Logistic Model with a Fixed Base 

We first derive some basic formulas needed for developing ABC-MART. Without loss of generality, we assume 
class is the base class. 

Lemma 1 provides the first and second derivatives of the class probabilities pi tk under the multi-class logistic 
model (9) and the sum-to-zero constraint J2k=o ^i,k — 0[5, 4, 8]. 



3 ABC-Boost and ABC-MART 



Lemma 1 



dF iik 

dpi,k 
dF hS 
dpifi 
dF hk 



Pi,k {Pi,o - Pi,s) , k^s^O 



Pi.k (1 +Pi,o ~Pi,k) , k ^ 



Pifi (-1 +Pi,0 - Pi.k 



), fc^O 



(14) 



(15) 



(16) 



Proof: Note that F iy0 — — J2^=i Fi,k- Hence 



e 




E 




= Pi,k (1+P»,0 ~Pi,k) ■ 



The other derivatives can be obtained similarly. □ 



Lemma 1 helps derive the derivatives of the loss function, presented in Lemma 2. 
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Lemma 2 



(r»,o - Pi,o) ~ (n,k ~ Pi,k) , (17) 



d 2 L % 
dFl k 



= Pi,o( 1 -Pifi) +Pi,k( 1 -Pi.k) + ZPtflPt.k- (18) 



Proof: 



It first derivative is 



K-l 

Li = - ri > s lo SPi,s - n,k logK, fc - r lfi \ogp lfi . 



9F iifc s Jit=£k Pi ' s ^' i ' k Pl - k ^' k Pi '° ^' k 
K-l 

= ^2 -n,s (Pi,o -Pi,k) - n,k (i +Pi,o -Pi,k) - n,o (-1 +Pi,o -Pi,k) 

s — l,s^k 
K-l 

= -~y] n,s {Pi,o - Pi,k) + n,a - n tk 

= {n,o - Pifi) - (ri,k ~Pi,k) ■ 
And the second derivative is 

d 2 Lj dp lfi dp lik 
dF tk ~ dF *.k dF hk 

= - Pifi (-1 +Pi,0 ~Pi,k) +Pi,k (1 +Pi,0 —Pi,k) 
=Pifi{l - Pifi) +Pi,k(l ~Pi,k) + 2pifiPi,k- □ 



Note the the first and second derivatives we derives differ from (11) and (12), which are the derivatives used 
in Logitboost[5] and MART[4]. 

3.2 ABC-MART 

Alg. 3 provides the pseudo code for ABC-MART. Compared with MART (Alg. 2), we use different gradients 
to build the trees and different second derivatives to update the values of terminal nodes. In addition, there is a 
procedure (Line 10) for selecting the base class, denoted by b, which has the "worst" (i.e., largest) training loss 
(6). At each boosting step, we only need to build K — 1 trees because the constraint J2k=a F i,k — is enforced. 

3.3 ABC-MART Recovers MART when K = 2 

Note that the factor does not appear in ABC-MART (Alg. 3). Interestingly, when K = 2, ABC-MART 
recovers MART. 

For example, consider K = 2, r ij0 = 1, r it i = 0, then 



9F iA 



= (1 - Pifi) - (0 - pi,i) = 2p iA , 



= ^PifiPi,!- 



In other words, the first (second) derivative is twice (four times) of the first (second) derivative in MART. Using 
the one-step Newton update (Line 6 in Alg. 3), the factor ^ (which appeared in MART) is recovered. Note that 
scaling the first derivatives does not affect the tree structures. 
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Algorithm 3 ABC-MART. 

0: r^fe = 1, if yi — k, r itk = otherwise. Choose a random base b 

1: F itk = 0, p 4 , fc = k = to A - 1, i = 1 to N 

2: For m = 1 to M Do 

3: For k = to K - 1, fc ^ 6, Do 

4: {-R?,fc,m}/ =1 = J-terminal node regression tree from {-(r i;6 - p i;6 ) + (r iik - Pi,k), ^i}f=i 

_ E Xi eR j>l m -(' , .,b"Pi,i,)+(n,fc-Pi,fc) 

5: Pj.fe.m - £ p'i,b(l-Pi,i.)+Pi,fc(l-Pi,fc)+2pi,bPi,fc 

7: End 

8: #,6 = - Efe#6 *i,fc. » = 1 to JV. 

9: p 4 , fc = exp(F i;fe )/ X^" 1 exp(F iia ), fc = to A" - 1, i = 1 to AT 

10: fe = argma Xj L( fc ), where L( fc ) = £^ - log (p hk ) l y . =k , k = 0, 1, K - 1. 

11: End 



3.4 MART Approximately "Averages" All Base Classes when K > 3 

In a sense, MART did consider the averaging affect from all K — 1 base classes. 
For the first derivatives, the following equality holds, 

^2 {-(n,b - Pi,b) + (n,k - Pi,k)} = K(n,k - p%,k), (20) 

because 

^2 {-{n,b - Pi,b) + {n,k - Pi,k)} 

b=£k 

= - 1 + r i)fc + 1 - Pi,k + {K - l)(r i)fc - p 4 , fc ) Recall : ^ r i;fe = 1, ^ p i;fe = 1 

fc=o fc=o 

=K(n ik -Pi,k)- 

In other words, the gradient used in MART is the averaged gradient of ABC-MART. 
Next, we can show that, for the second derivatives, 

{( X ~ Pi,b)Pi,b + (! - Pi,k)Pi,k + %Pi,bPi,k} ~ (K + 2)(1 - Pi,k)Pi,k, (21) 
with equality holding only when if = 2, because 

X] {(! - Pi,b)Pi,b + (1 - Pi,k)Pi,k + 2Pi,bPi,k} 
b^k 

=(K - 1)(1 - p iik )Pi,k + 1p%,kY^Pi.b + ^(1 - Pi,b)Pi,b 

b^k b^k 

=(k + i)(i -Pi,k)pi,k +^2?i,b -^2?ib 

b=£k b^k 

i(K + 1)(1 ~Pi,k)Pi,k +^Pi,b - X] K > b 

= (if + 1)(1 - Pi,k)Pi,k + (1 - Pi,k)Pi,k 

={K + 2){r iik -p iik ). 
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Thus, even the second derivative used in MART may be approximately viewed as the averaged second deriva- 
tives in ABC-MART. It appears the factor in MART may be reasonably replaced by -^-^ (both equal | 
when K — 2). This, of course, will not make a real difference because the constant (either ^j^- or -^p^) can ^ e 
absorbed into the shrinkage factor v. 



4 Evaluations 

The goal of the evaluation study is to compare ABC-MART (Alg. 3) with MART (Alg. 2) for multi-class 
classification. The experiments were conducted on one fairly large data set (Covertype) plus five small data sets 
{Letter, Pendigits, Zipcode, Optdigits, and Isolet); see Table 1. 



Table 1: Our experiments were based on six publicly available data sets. We randomly split the Covertype data 
set into a training set and test set. For all other data sets, we used the standard (default) training and test sets. 



Data set 


K 


# training samples 


# test samples 


# features 


Covertype 


7 


290506 


290506 


54 


Letter 


26 


16000 


4000 


16 


Pendigits 


10 


7494 


3498 


16 


Zipcode 


10 


7291 


2007 


256 


Optdigits 


10 


3823 


1797 


64 


Isolet 


26 


6218 


1559 


617 



In general, a comprehensive and fair comparison of two classification algorithms is a non-trivial task. In our 
case, however, the comparison task appears quite easy because ABC-MART and MART differ only in a few lines 
of code and their underlying base learners can be completely identical. 

Ideally, we hope that ABC-MART will improve MART, for every set of reasonable parameters, J and v. For 
the five small data sets, we experiment with every combination of number of terminal nodes J and shrinkage v, 
where J and v are chosen from 

J e {4, 6, 8, 10, 12, 14, 16}, v e {0.04, 0.06, 0.08, 0.1}. 

For the five small data sets, we let the number of boosting steps M = 10000. However, the experiments 
usually terminated well before M — 10000 because the machine accuracy was reached. 

For the Covertype data set, since it is fairly large, we only experimented with J = 6, 10, 20 and v — 0.04, 0.1 
and we limited M to be 16000, 1 1500, 6000, for J = 6, 10, 20, respectively. 

4.1 Summary of Experiment Results 

The test misclassification error is a direct measure of performance. MART and ABC-MART output K class 
probabilities p i fc , k = to K — 1, for each observation i. To obtain the class labels, we adopt the commonly used 
rule 

k' = argmaxp iife . (22) 

k 

We define R err , the "relative improvement of test mis-classification errors," as 
„ mis-classification errors of MART - mis-classification errors of ABC-MART N 

mis-classification errors of MART 
Since we experimented with a series of parameters, J, u, and M, we report the overall "best" (i.e., smallest) 
mis-classification errors in Table 2. Later, we will also report the more detailed mis-classification errors for every 
combination of J and v. in Sections 4.2 to 4.7. 



2 A11 data sets are publicly available. The Zipcode data set is downloaded from http://www-stat.stanford.edu/~tibs/ 
ElemStatLearn/data . html and all other data sets can be found from the UCI repository. 
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Table 2: Summary of test mis-classification errors. 



Data set 


MART 


ABC-MART 


Rerr (%) 


P-value 


Covertype 


11133 


10203 


8.4 


4.4 x 10 


Letter 


135 


111 


17.8 


0.060 


Pendigits 


123 


104 


15.5 


0.100 


Zipcode 


111 


98 


11.7 


0.178 


Optdigits 


56 


41 


26.8 


0.061 


Isolet 


84 


69 


17.9 


0.107 



Table 2 also provides the "P-value" of the one-sided t-test. The idea is to model the test error rate (i.e., the 
test mis-classification errors divided by the number of test samples) as a binomial probability and then conduct 
the t-test using the normal approximation of the difference of two binomial probabilities. We can see that for the 
Covertype data set, the improvement of ABC-MART over MART is highly significant (the P-value is nearly zero) 
under this test. For the five small data sets, the P-values are also reasonably small. 

We shall mention that this t-test is very stringent when the error rate is small. In fact, we do not often see 
papers which calculated the P-values when comparing different classification algorithms. 

Next, we present the detailed experiment results on the six data sets. 

4.2 Experiments on the Covertype Data Set 

Table 3 summarizes the test mis-classification errors along with the relative improvements (R err ), for every 
combination of J € {6, 10,20} and v e {0.04,0.1}. For each J and v, the smallest test mis-classification 
errors, separately for ABC-MART and MART, are the lowest points in the curves in Figure 2. 



Table 3: Covertype. The test mis-classification errors. The corresponding relative improvements (P err , %) of 
ABC-MART are included in parentheses. 



(a) MART (b) ABC-MART 







v = 0.04 


v = 0.1 






v = 0.04 


v = 0.1 


J = 


6 


20756 


15959 


J = 


6 


17185(17.2) 


14230(10.8) 


J = 


10 


15862 


12553 


J = 


10 


13064(17.6) 


11487 (8.5) 


J = 


20 


13630 


11133 


J = 


20 


11595(14.9) 


10203 (8.4) 



To report the experiments in a more informative manner, Figure 1, Figure 2, and Figure 3, respectively, present 
the training loss, the test mis-classification errors, and the relative improvements, for the complete history of M 
boosting steps (iterations). 

Figure 1 indicates that ABC-MART reduces the training loss (6) considerably and consistently faster than 
MART. Figure 2 demonstrates that ABC-MART exhibits considerably and consistently smaller test mis-classification 
errors than MART. 

Figure 3 illustrates that the relative improvements of ABC-MART over MART, in terms of the mis-classification 
errors, may be considerably larger than the numbers reported in Table 3 if we stop the training earlier. This phe- 
nomenon may be quite beneficial for real-world large-scale applications when either the training or test time is 
part of the performance measure. For example, real-time applications (e.g., search engines) may not be able to 
afford to use models with a very large number of boosting steps. 
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10 



1 4000 8000 12000 16000 
Iterations 




4000 8000 12000 

Iterations 



10 





Train: J = 6 v = 04 
























ABC ~ 







1 4000 8000 12000 16000 
Iterations 



2000 4000 
Iterations 



6000 




4000 8000 12000 

Iterations 



2000 4000 
Iterations 



6000 



Figure 1: Covertype. The training loss, i.e., (6). The curves labeled "ABC" correspond to ABC-MART. 
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The next five subsections are respectively devoted to presenting the experiment results of five small data sets. 
The results exhibit similar characteristics across data sets. 

4.3 Experiments on the Letter Data Set 

Table 4 summarizes the test mis-classification errors along with the relative improvements for every combination 
of J e {4, 6, 8, 10, 12, 14, 16} and v E {0.04, 0.06, 0.08, 0.1}. 

Table 4: Letter. The test mis-classification errors. The corresponding relative improvements (R er r, %) of ABC- 
MART are included in parentheses. 



(a) MART 









v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 




J = 


4 


176 


178 


177 


173 




J = 


6 


154 


160 


156 


158 




J = 


8 


151 


145 


151 


154 




J = 


10 


141 


141 


147 


144 




J = 


12 


150 


144 


144 


140 




J = 


14 


143 


147 


144 


146 




J = 


16 


138 


147 


142 


135 










(b) ABC-MART 








v = 


: 0.04 


V = 0.06 


v = 0.08 


v = 0.1 


J 


= 4 


155 (11.9) 


143(19.7) 


148 (16.4) 


144(16.8) 


J 


= 6 


140 (9.1) 


141 (11.9) 
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Figure 4 again indicates that ABC-MART reduces the training loss (6) considerably and consistently faster 
than MART. Since this is a small data set, the training loss could approach zero even when the number of boosting 
steps is not too large. 

Figure 5 again demonstrates that ABC-MART exhibits considerably and consistently smaller test mis-classification 
errors than MART. 
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Figure 4: Letter. The training loss, i.e., (6). The curves labeled "ABC" correspond to ABC-MART. 
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Figure 5: Letter. The test mis-classification errors. 
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4.4 Experiments on the Pendigits Data Set 



Table 5: Pendigits. The test mis-classification errors. The corresponding relative improvements (R err , %) of 
ABC-MART are included in parentheses. 

(a) MART 

v = 0.04 v = 0.06 v = 0.08 v = 0.1 



J = 4 


144 


145 


142 


148 


J = 6 


137 


134 


136 


135 


J = 8 


133 


134 


132 


128 


J = 10 


123 


131 


127 


130 


J = 12 


135 


136 


134 


133 


J = 14 


129 


131 


130 


133 


J = 16 


132 


129 


130 


134 



(b) ABC-MART 






v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J = 


4 


112(22.2) 


109 (24.8) 


110 (22.5) 


112 (24.3) 


J = 


6 


114(16.8) 


114 (14.9) 


111 (18.4) 


109 (19.3) 


J = 
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112(15.8) 


107 (20.1) 


111 (15.9) 


105 (18.0) 


J = 


10 


105 (14.6) 
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J = 
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J = 


14 


107(17.1) 


106 (19.1) 
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107 (19.5) 


J = 


16 


106(19.7) 


107 (17.1) 


106 (18.5) 


107 (20.1) 
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Figure 6: Pendigits. The test mis-classification errors. 
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4.5 Experiments on the Zipcode Data Set 

Table 6: Zipcode. The test mis-classification errors. 



(a) MART 







v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 




J = 


4 130 


126 


129 


127 




J = 


6 123 


122 


123 


126 




J = 


8 120 


122 


121 


123 




J = 


10 118 


118 


119 


118 




J = 


12 117 


117 


116 


118 




J = 


14 119 


120 


119 


117 




J = 


16 118 


111 


116 


116 








(b) ABC-MART 








v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J 


= 4 


116 (10.8) 


114(9.5) 


116(10.1) 


111 (12.6) 


J 


= 6 


110(10.6) 


111 (9.0) 


109(11.4) 


108 (14.3) 
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= 8 


109 (9.2) 


102(16.4) 
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106 (13.8) 
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= 10 
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104 (11.9) 
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103 (11.2) 


98 (16.9) 
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= 14 
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104(13.3) 


103 (13.4) 


105 (10.3) 
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104(11.9) 


103 (7.2) 


102(12.1) 


104 (10.3) 
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Figure 7: Zipcode. The test mis-classification errors. 
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Experiments on the Optdigits Data Set 

Table 7: Optdigits. The test mis-classification errors 



(a) MART 







v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J = 


-- 4 


58 


57 


57 


59 


J = 


= 6 


60 


57 


59 


57 


J = 


= 8 


60 


60 


57 


61 


J = 


= 10 


59 


59 


58 


59 


J = 


= 12 


57 


58 


56 


61 


J = 


= 14 


57 


59 


58 


56 


J = 


= 16 


59 


58 


59 


57 


(b) ABC-MART 






v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J = 


4 


45 (22.4) 


41 (28.1) 


44 (22.8) 


42 (28.8) 


J = 


6 


47 (21.7) 


48 (15.8) 


47 (20.3) 


45 (21.1) 


J = 


8 


49(18.3) 


49 (18.3) 


49 (14.0) 


46 (24.6) 


J = 


10 


55 (6.8) 


54 (8.5) 


50(13.8) 


49 (16.9) 


J = 


12 


56 (1.8) 


55 (5.2) 


50 (10.7) 


51 (16.4) 
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14 


55 (3.5) 


53 (10.2) 


50(13.8) 


47(16.1) 
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57 (3.4) 
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53 (10.2) 


53 (7.0) 
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Figure 8: Optdigits. The test mis-classification errors. 
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Experiments on the Isolet Data Set 

Table 8: Isolet. The test mis-classification errors. 

(a) MART 

v = 0.04 v = 0.06 v = 0.08 v = 0.1 



J = 4 


84 


84 


85 


84 


J = 6 


84 


84 


85 


85 


J = 8 


87 


89 


86 


85 


J = 10 


88 


87 


84 


84 


J = 12 


88 


91 


85 


86 


J = 14 


94 


93 


90 


91 


J = 16 


93 


91 


90 


88 



(b) ABC-MART 







v = 0.04 


v = 0.06 


v = 0.08 


v = 0.1 


J = 


4 


74(11.9) 


74(11.9) 


74(12.9) 


72(14.3) 


J = 


6 


76 ( 9.5) 


71 (15.5) 


69(18.8) 


71 (16.5) 


J = 
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72(17.2) 


74(16.9) 


73 (15.1) 


73 (14.1) 


J = 


10 


72 (18.2) 


76 (12.6) 


75 (10.7) 


71 (15.5) 


J = 


12 


78 (11.4) 


74(18.7) 


75 (11.8) 


72(16.3) 


J = 


14 


74 (21.3) 


72 (22.6) 


71 (21.1) 


79(13.2) 


J = 


16 


79(15.1) 


75 (17.6) 


76(15.6) 


79 (10.2) 
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Figure 9: Isolet. The test mis-classification errors. 
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5 Discussion 



Retrospectively, the ideas behind ABC-Boost and ABC-MART appear simple and intuitive. Our experiments in 
Section 4 have demonstrated the effectiveness of ABC-MART and its considerable improvements over MART. 

The two key components of ABC-Boost are: (1) developing boosting algorithms by assuming one base class; 
(2) adaptively choosing the base class so that, at each boosting step, the "worst" class will be selected. We believe 
both components contribute critically to the good performance of ABC-MART. 

Note that assuming the sum-to-zero constraint on the loss function and the base class is a ubiquitously adopted 
strategy[5, 4, 8]. Our contribution in this part is the different set of derivatives of the loss function (6), compared 
with the classical work[5, 4]. 

One may ask two questions. 1: Can we use the MART derivatives (11) and (12) and adaptively choose the 
base? 2: Can we use ABC-MART derivatives and a fixed base? Neither will achieve a good performance. 

To demonstrate this point, we consider three alternative boosting algorithms and present their training and test 
results using the Pendigits data set (K = 10). 

1. MART derivatives + adaptively choosing the worst base. It is the same as ABC-MART except it uses the 
derivatives of MART. In Figure 10, we label the corresponding curves by "Mb." 

2. ABC-MART derivatives + a fixed base class chosen according to MART training loss. In the experiment 
with MART, we find "class 1" exhibits the overall largest training loss. Thus, we fix "class 1" as the base 
and re-train using the derivatives of ABC-MART. In Figure 10, we label the corresponding curves by "bl." 

3. ABC-MART derivatives + a fixed base class chosen according to MART test mis-classification errors. In 
the experiment with MART, we find "class 7" exhibits the overall largest error. We fix "class 7" as the base 
and re-train using the derivatives of ABC-MART. In Figure 10, we label the corresponding curves by "b7." 

Figure 10 demonstrate that none of the alternative boosting algorithms could outperform ABC-MART. 
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Figure 10: Pendigits. The training loss and test mis-classification errors of three alternative boosting algorithms 
(labeled by "Mb", "bl", and "b7"), together with the results of ABC-MART and MART (labeled by "M"). 



6 Conclusion 

We present the general idea of ABC-Boost and its concrete implementation named ABC-MART, for multi-class 
classification (with if > 3classes). Two key components of ABC-Boost include: (1) By enforcing the (commonly 
used) constraint on the loss function, we can derive boosting algorithms for K — 1 class using a base class; (2) 
We adaptively choose the current "worst" class as the base class, at each boosting step. Both components are 
critical. Our experiments on a fairly large data set and five small data sets demonstrate that ABC-MART could 
considerably improves the original MART algorithm, which has already been highly successful in large-scale 
industry applications. 
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